Disclaimer - some of it is quite specific to coding in R
Why This is Important
Good coding practices are important because it helps ensure that your code is:
readable - is the code written in a clear and readable way?
understandable - is the code easy to follow? Is it clear what is being done at each stage?
reproducible - could another person be able to re-run the code?
Why This is Important
This is useful for working on any type of coding project. For solo projects, you may think your code makes perfect sense when you are writing it, but when you look back it does not… For working with others it is important it not only makes sense to you but to your collaborators
Why This is Important
Introducing some of these practices into your workflow at the start of projects can help them run more smoothly, avoid potential confusion and save time further down the line.
There is a move for science to become more reproducible, with more researchers making their code and data available when publishing a paper. This means your code may be viewed by a wider range of people and, therefore, even more important the code is reproducible.
Note the exception for between a function name and opening of brackets
# like this data$patient_age_grp <-if_else(data$patient_age <=55, 0, 1)# not like thisdata$patient_age_grp <-if_else (data$patient_age <=55, 0, 1)
Readable Code
Avoid lines that are too long
# badggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +theme_minimal() +labs(title ="Age and length of stay of patients at 10 hospital trusts", x ="Patient Age (years)", y ="Patient Length of Stay (Days)")
# betterggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +theme_minimal() +labs(title ="Age and length of stay of patients at 10 hospital trusts", x ="Patient Age (years)", y ="Patient Length of Stay (Days)")
Readable Code
If using the tidyverse or ggplot2 then start a new line after each %>% or +
data %>%filter(organisation_name =="Trust1") %>%ggplot(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +geom_point() +theme_minimal()
Readable Code
Use functions to avoid repeating lines of code General rule of thumb is that if you copy and paste a section of code more than two times then you should make a function
Chapter 25 of the R4DS book is a useful place to start
Use comments to annotate your code so it is easier to follow. Particularly for documenting WHY you have done something
# Anything preceded by a # will not be executed by R# 10*1510*20
[1] 200
Readable Code
Use comments to annotate your code so it is easier to follow. Particularly for documenting WHY you have done something
# At the start of an R script I usually write a few lines describing what the script is for, # what the input data is and what the expected outputs are. # Then use comments throughout to break up the code and explain the analysis. Example:# 15 patients in dataset with their age missing, excluding them from analysisdata <- data %>%filter(!is.na(patient_age))# n = 285 from here
Naming Things
When you are naming new variables choose names that are descriptive. Do not duplicate names
When you are naming new variables or functions choose names that are descriptive. Do not duplicate names
# badmodel_a <-glm(data$patient_age ~ data$length_of_stay, family =gaussian())model_b <-glm(as.factor(data$death_flag) ~ data$patient_age, family =binomial())# bettermodel_los_age <-glm(data$length_of_stay ~ data$patient_age, family =gaussian())model_death_age <-glm(as.factor(data$death_flag) ~ data$patient_age, family =binomial())
Naming Things
For naming files again use descriptive names. If working on a larger project then consider having a separate file for each stage of the project, and make it clear what order the analysis has been done in.
For example: 01_data_cleaning.R 02_baseline_characteristics.R 03_descriptive_stats.R 04_models.R 05_figures.R
Organising Your Work
Within an R script you can use sections to organise your scripts.
Insert a new section using ctrl + shift + R and navigate using the document outline on the right of the script
Organising Your Work
Organising Your Work
Working within an R Project is a good way to organise not only your R scripts but keeps all the data and outputs from your work in the same place.
Avoids the need to use set_wd() at the start of your scripts, which is not best practice, particularly when collaborating with others.
Organising Your Work
set_wd() uses absolute file paths, e.g.
setwd("C:/Users/mfbx9sbk/OneDrive - The University of Manchester/MSc Teaching/coding_best_practice_2")
This can cause problems when you are collaborating with others, as not everyone will have their files organised in the same way.
Organising Your Work
R Projects use relative file paths, which are relative to the working directory of the project.
For example, you want to save a cleaned version of your data, or a plot you have generated.
Organising Your Work
Here the file paths are relative to the Project directory
ggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay))
If you have ever had a bunch of files that look something like this then you may want to consider using a version control system to manage your projects
Version Control
Using a version control system can:
help organise your work and keep track of updates and changes
make it easier to collaborate with others
create a repository that can be shared more widely when a project is complete
be difficult to navigate at first but quickly become integrated into your regular workflow
Version Control
The most widely used (in the data science community) software for version control is Git. Git takes snapshots of all files in a project at a specific time - referred to as a “commit”. It stores the initial version and any subsequent updated versions that are committed. It tracks any changes you have made at each commit, which can be identified using the “diff” command
Version Control
GitHub is a complementary hosting platform for your repositories (others are available). Once updates have been committed to Git they can be “pushed” to GitHub. Collaborators can then “fork” a copy of the repository and work on it locally whilst you are also still working on it, by pushing and pulling commits to GitHub.
Version Control
What a repository looks like on GitHub
Version Control
I would recommend reading this article which explains in more detail about how to use Git and GitHub.
Version Control
Git can be integrated into RStudio and therefore more easily be incorporated into your workflow. Once installed an additional tab in the environment pane will appear, where you can commit and push files.
Version Control
Or you can go into the RStudio terminal tab and type Git commands from there
Version Control
To install Git and connect it to your GitHub and RStudio then follow this tutorial by Jenny Bryan. It talks through each setup step and how to do basic Git commands.